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Abstract. Statistical distributions with heavy tails are ubiquitous in natural and social phenomena. Since 
the entries in heavy tail have unproportional significance, the knowledge of its exact shape is very important. 
Citations of scientific papers form one of the best-known heavy tail distributions. Even in this case there 
is a considerable debate whether citation distribution follows the log-normal or power-law fit. The goal 
of our study is to solve this debate by measuring citation distribution for a very large and homogeneous 
data. We measured citation distribution for 418,438 Physics papers published in 1980-1989 and cited by 
2008. While the log-normal fit deviates too strong from the data, the discrete power-law function with the 
exponent 7 = 3.15 does better and fits 99.955% of the data. However, the extreme tail of the distribution 
deviates upward even from the power-law fit and exhibits a dramatic "runaway" behavior. The onset 
of the runaway regime is revealed macroscopically as the paper garners 1000-1500 citations, however the 
microscopic measurements of autocorrelation in citation rates are able to predict this behavior in advance. 
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1 Introduction 

Dynamic statistical distributions found in nature frequently 
exhibit heavy tails that are usually approximated using 
log- normal or power- law functions pQ. However, there can 
be a few individual values in the extreme end of the distri- 
bution that exceed by far both these fitting functions. We 
advance the hypothesis that these "runaways" can even- 
tually capture the entire system. Statistical distribution 
of citations of scientific papers offers a good example of 
such dynamics. 

Intensive studies of citation statistics were triggered 
by the seminal work of de Solla Price [2] who showed that 
the statistical distribution of citations of scientific papers 
has heavy tail that can be approximated by a power-law 
function. De Solla Price proposed a microscopic "cumu- 
lative advantage" mechanism [3] that in the long time 
limit generates such power-law distribution. Following his 
line of thought, subsequent studies of citation statistics 
[4,5,6,7,8, 9 J used power-law probability distributions to 
fit their data. Later on, it turned out that stretched ex- 
ponential [TUlfTTlfT^] or log-normal P^lH4llT51[To] functions 
fit the measured citation distributions equally well. Ref. 
[T7] showed that the choice between the power-law and 
the log- normal fits is extremely difficult. Indeed, the most 
important difference between the two is that the power- 
law decays slower than the log-normal, in other words it 



has heavier tail. To probe this tail one has to measure 
a very large data set in order to achieve enough entries 
in the tail. Previous studies of citations of scientific pa- 
pers used either small [T4"Ill"5] or inhomogeneous [13] data 
sets that were insufficient to distinguish between the log- 
normal and power-law fits. 

The initial goal of our study was to find out which 
function fits better the citation distribution in the long 
time limit: the power- law or the log-normal. To achieve 
this goal we measured citation distribution for a very large 
and homogeneous data set containing almost all Physics 
papers published in 1980-1989 and cited by 2008. Surpris- 
ingly, we found that (i) the citation distribution is not 
stationary even 25 years after publication, and (ii) both 
log-normal and power-law fitting functions fail to describe 
the extreme tail of the distribution since it develops a run- 
away behavior. 



2 Theoretical framework 

In the long time limit the de Solla Price's Cumulative 
Advantage mechanism [3] generates the following citation 
distribution 
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Here, k is the number of citations, B(a, b) = 
the beta-function, r is the gamma-function, and w, 7 are 
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parameters. The EqQ]is known as Waring [TMTWlZt)] ) or 
discrete power-law distribution since for k > > w it asymp- 
totically approaches the power-law dependence, p(k) ~ 
(w/k) 1 . The continuous approximation of EqJJJ 



p{k) w (7 - 1) 



,,7-1 



(w + k)t 



(2) 



is known as the Zipf-Mandelbrot or Pareto-type II distri- 
bution and was also used in citation analysis [T0"Il2"0"Il2"2"] . 

Several recent works [TBlfTilfTMTB"] fitted measured ci- 
tation distribution with the log-normal 
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function that was modified to describe such non-negative 
and discrete variable as citations. Although the functions 
represented by Eqs lII3l are dramatically different, the fits 
of the citation distributions based on EqfT] and on the dis- 
crete version of Eqj3] are virtually undistinguishable for a 
wide (but finite!) range of values. Thus, discrimination be- 
tween the discrete log-normal and the discrete power-law 
distributions is quite ambiguous 17,23]. This ambiguity 
is illustrated by Fig. [JJ that shows cumulative probability 
distribution of citations to 353,268 Physics papers. The 
data lie just in between the log-normal cdf and discrete 
power-law cdf. 
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Fig. 1. Cumulative probability distribution (cdf) of citations 
to 353,268 papers published in Physical Review journals during 
f893-2003 and cited by 2003. Only PR to PR citations were 
counted. The data were adapted from Ref. [T5] . The continuous 
red line shows a fit with the discrete-power-law cdf (EqfTJ with 
7 = 3.15, to = 10.2. The dashed blue line shows a fit with the 
log-normal cdf (EqO with fj, = 1.15, a = 1.42. 



The solution of the dilemma " power-law or log- normal" 
comes from an unexpected direction. Our measurements 



demonstrate that (i) the citation distribution follows nei- 
ther power-law nor log- normal fit, (ii) it is non-stationary, 
and (iii) the individual values in the tail grow at a much 
faster rate than the rest of the distribution, indicating the 
runaway effect. As the sample size and the evolution time 
increase, the runaways become more prominent. The con- 
tribution of non-stationarity in the emergence of distribu- 
tions with heavy and super-heavy tails has been studied in 
the context of cities population, wealth distribution and 
growing networks 24.25; 26] . 



3 Citation Distribution for Physics Papers 

We considered research papers in Physics that were pub- 
lished in 82 leading Physical journals in the period from 
1980 to 1989 (excluding the overview papers and papers 
published in the popular science journals) -418,438 papers 
in total. We measured the number of citations gained by 
each paper by July 2008. Since the citation life of an or- 
dinary paper rarely exceeds 15 years and the time span 
between the publication and citation count for the papers 
in our data set is 20-28 years, we expected that the citation 
distribution for this data set of " old" papers is stationary. 

Figure [2] displays this distribution. The log- normal cdf 
fits the data only for the papers with less than 400 cita- 
tions (these constitute 99.7% of all papers), while the tail 
deviates upward. The discrete power-law cdf fits the data 
in Fig.[5]much better. However, it also fails to describe the 
extreme tail of the distribution that contains the papers 
with more than 1000 citations (0.04 % of all papers). This 
is most dramatically illustrated in Fig(3] where we plot- 
ted the ratio of the measured number of citations to the 
fitted values using log-normal cdf and discrete power-law 
cdf. The number of citations for all the 190 papers that 
have more than 1000 citations significantly and increas- 
ingly exceed both the power-law and the log-normal fits. It 
is instructive to compare the onset of this super-heavy or 
"runaway" tail to the mean number of citations, m = 24. 
Figures 12131 indicate that the " runaways" become visible 
at k/m = 20 — 40 and extend at least to k/m = 1000. 
The previous smaller scale studies [T4"1IT5] were limited to 
k/m < 40 and didn't provide enough statistics to probe 
the " runaways" . 

We conclude that citation distribution for the small ho- 
mogeneous data sets (~1000 papers) can be fitted equally 
well by the log-normal and by the discrete power-law func- 
tions. Larger sets (^10,000 papers) are better fitted by the 
discrete power-law cdf. The runaway tail that exceeds even 
the power-law fit becomes apparent when the number of 
papers in the set exceeds 10,000. In what follows we focus 
on the microscopic mechanism responsible for runaways. 

4 Microscopic mechanisms of citation 
dynamics that are able to produce the 
runaway behavior 

1. Superlinear Yule-Simon process. The most accepted 
microscopic mechanism of citation dynamics is the pref- 
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Fig. 2. Cumulative distribution function (cdf) of citations to 
418,438 papers published in 82 leading Physical journals in 
1980-1989 and cited by July 2008. The horizontal axis is k + 1 
instead of k in order to show uncited papers, k = 0, on the 
log-scale. The dashed blue line shows a fit with the log-normal 
cdf (Eq(3]l with fj, — 2.16 and a = 1.38. The deviation from the 
fit appears already at k > 400. The continuous red line shows 
a fit with the discrete-power-law cdf (Eq.[T]) with 7 = 3.15 and 
w = 27.5. This function fits the data up to k — 1000. However, 
the tail of the distribution with k > 1000 deviates upwards 
even from this power-law fit, indicating "runaway" papers. 

erential attachment also known as the cumulative ad- 
vantage, Gibrat, or Yule process [51I27LI28U29] . This mech- 
anism assumes that the citation process is a memory- 
less Markov chain whereas the citation rate of a paper 
is determined by the total number of accumulated ci- 
tations k and the time after publication i, 

flk 

-=A(t)(k+k r (4) 

Here, A is the aging parameter, a is the attachment 
exponent, and kg is the initial attractiveness. The Refs. 
[28,29] showed that the linear preferential attachment, 
a = 1 yields the discrete power-law citation distribu- 
tion (Eq[IJ while the superlinear preferential attach- 
ment, a > 1, results in the runaway behavior where a 
few papers contain a finite fraction of all citations. 
2. Stochasticity. The heavy tail distribution can result 
from the stochastic and non-stationary character of 
the dynamic variable [24,25,30 . For dynamic variable 
with lower bound, such as citations, the stochastic 
mechanism yields the power-law distribution whose ex- 
ponent is the ratio between the average growth rate 
and its individual fluctuations [251126]. The finite size 
effects may affect the power-law exponent to the ex- 



Fig. 3. The ratio of the measured cumulative probability of 
citations (FigfJI to the discrete power-law cdf (red circles) and 
to the log-normal cdf (blue circles). The log-normal fit fails 
already for k > 400. The discrete power-law provides a good 
fit for k < 1000 while pronounced deviation appears above 
k > 1000. 

tent of concentrating the entire distribution in just a 
few individual entries in the end of the heavy-tail [241 
[3Tj . Hence, the stochasticity can partially explain the 
power-law citation distribution and runaway behavior. 
3. Memory. If the assumption of a memoryless Markov 
chain that stands behind Eq0]is lifted (in other words, 
citation process has some memory) this can result in 
the distribution with the runaway tail. Even a weak 
memory of the past dynamics and its feedback on the 
rate of the multiplicative random walk may lead to 
divergent statistical distribution and runaways [32] . 

To verify to which these mechanisms shape the citation 
distribution we measured citation dynamics of individual 
papers. We expected that once the microscopic dynamics 
is uncovered, we will be able to explain the "runaway tail" . 



4.1 Microscopic measurements of citation dynamics of 
individual papers 

In distinction to the measurements shown in Figj^] where 
we measured aggregated citation distribution for the 418, 
438 Physics papers published during a decade (1980-1989) 
and didn't trace citation dynamics of individual papers; 
here we considered the Physics papers published in one 
year (1984) and analyzed citation history of each paper 
up to July 2008. This data set contains a smaller number 
of entries (40,195) but it is age-homogeneous. 
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4.1.1 Divergence of citation dynamics of similar papers 

As an example, from all Physics papers published in 1984 
we chose the subset containing all those papers that gar- 
nered 30-31 citations by the end of 1986. Figure H] displays 
citation dynamics of these 89 papers. Although they rep- 
resent an extremely homogeneous subset (the same field, 
the same publication year, the same citation prehistory), 
the divergence in their citation behavior is striking. 











2000 


lations 


/ ■ 




0) 1500 

c 

o 


o 


k=30-31 jf 




tati 




T=3 




o 1000 




Time {yrs) 




500 
n 







5 10 15 20 25 

Time (yrs) 

Fig. 4. Citation dynamics of 89 Physics papers published in 
1984. We chose all those papers that by 1986 (three years af- 
ter publication) had 30 or 31 citations. Although the initial 
citation dynamics of these papers is very similar, it quickly di- 
verges in such a way that after 25 years (in 2008) the number 
of citations varies between 40 and 2254. 

We ranked the papers according to the number of ci- 
tations garnered by 2008. The list opens with the pair of 
papers: "Lower critical dimension of the random- field Ising 
model- a Monte-Carlo study" by D. Andelman, H. Orland, 
and L.C.R. Wijwardhana, Phys.Rev. Lett. (40 citations); 
and "Fractional quantum Hall-effect at filling factors up 
to v = 3" by G. Ebert, K. von Klitzing, and J.C.Maan, 
J. of Physics C- Solid State (49 citations). The lists ends 
with the papers: "Dynamics of supercooled liquids and 
the glass transition" by U. Bengtzelius, W. Gotze and A. 
Sjolander, J. of Physics C- Solid State (798 citations); and 
"Embedded-atom method: Derivation and application to 
impurities, surfaces, and other defects in metals" by M.S. 
Daw and M.I. Baskes, Phys. Rev. B (2254 citations). All 
four papers were published actually in the same journals 
in the same year and they have the same citation pre- 
history. The two former papers are important works with 
typical citation dynamics. The two latter papers exhibit 
strongly different citation dynamics and the last one is a 
clear runaway. S.Redner |13j already demonstrated that 
the citation history of the citations classics is strongly in- 
dividualized. Our analysis substantiates this observation 
and extends it to all papers. 



4.1.2 Analysis of the citation dynamics 

We analyzed citation dynamics of all 40,195 Physics pa- 
pers published in 1984 using the framework of Eqj4] We 
found a = 1.2 - 1.28, k = 1.1 and A = 3.5/(i + 0.3) 2 . 
Since the deviation from the linearity is small, a — 1 << 1, 
and the aging parameter A strongly decays with time, the 
runaway behavior generated by EqfJ] is too slow. We con- 
clude that while the nonlinear preferential attachment is 
the dominant mechanism that shapes the observed power- 
law citation distribution, it can not produce runaways. 

As is clearly seen from the FigJH the fluctuations of the 
citation rate of individual papers (the ripple on the con- 
tinuous lines) is small. Therefore, while the stochasticity 
affects the shape of the citation distribution its contribu- 
tion is not dominant. Neither it can be responsible for 
runaways. 

The strong systematic differences between the citation 
behavior of similar papers shown in Figf4] is at odds with 
the preferential attachment paradigm that assumes simi- 
lar citation dynamics for the papers having the same past 
number of citations. As borne out by the data, there are 
strong correlations between the citation rates of the same 
paper in different years, i.e. the citation process has mem- 
ory [33 . As we show below, the memory dominates the 
dynamics of heavily cited papers and turns the citation 
process in a de facto deterministic one, at least for the 
runaway papers. Indeed, Fig. 0] shows that the number of 
citations for the most part of the papers comes to satu- 
ration after 15 years. However, some papers continue to 
be cited with undiminished rate even after 25 years. It is 
clearly seen that the citation behavior of these runaway 
papers is not erratic, the citation rates in subsequent years 
are strongly correlated. 

4.1.3 Autocorrelation 

To characterize the correlations we chose the sets of simi- 
lar papers and measured autocorrelation between the cita- 
tion rates of subsequent years. To this end we considered 
the sets of papers that were published in the same year 
and that garnered the same number of citations, k, after 
t years. For each paper i we found the number of cita- 
tions acquired during two subsequent years - Aki(t) and 
Aki(t — 1)- and calculated the Pearson autocorrelation co- 
efficient, 



(Ah(t) - Ah(t)) (Ah(t - 1) - Akt(t - 1)) 

C{,t-1 = 

(5) 

Here at,o~t-i are standard deviations of the Aki(t) and 
Aki(t — 1) distributions, respectively; and the averaging 
is performed over all papers in the set. Figure [5] shows 
that for moderately cited papers the autocorrelation is 
weak, Ct t t-\ « 1, as expected for a memoryless process. 
However, the papers with more than 1000 citations have 
c > 0.9. This means that their citation behavior is almost 
deterministic. While the citation rate of the most part of 
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have positive connotation and indicate very important pa- 
pers in the field. The social and epistemological implica- 
tions of our study might contribute to the current effort 
of evaluating and predicting scientific production in terms 
of citations. 

Our results are important also in a broader context of 
the dynamics of complex systems. Our citation statistics 
results add themselves to the relatively small number of 
systems for which the microscopic elementary laws were 
uncovered and whose macroscopic consequences (obtained 
theoretically and by simulation) passed successfully the 
confrontation with the empirically observed macroscopic 
phenomena [34l[35ll36j . 

The present study should be viewed as a part of the 
more generic effort of understanding autocatalytic dynam- 
ics as the bridge that allows the promotion of simple mi- 
croscopic interactions to complex, macroscopic collective 
phenomena. 



6 Conclusions 



Fig. 5. The Pearson autocorrelation coefficient for additional 
citations (Eq[SJ. t is the number of years after publication. 
c t ,t-\ steadily increases with the number of previous citations 
k, and exceeds 0.9 for the papers with more than 1000 citations. 



1 Aggregate citation distribution for the set of "old" 
Physics papers is better fitted by the discrete power- 
law (Waring) distribution rather than by the log-normal 
distribution. 



the papers decays with time, these papers continue to be 
highly-cited and eventually develop into runaways. (In- 
deed, the onset of the runaway tail in Fig|3] occurs at the 
same number of citations, k = 1000, at which the autocor- 
relation coefficient approaches unity.) In the long run, the 
citation rate of these runaways shall nevertheless decrease 
due to yet another mechanism - these prominent papers 
eventually become common knowledge and need not to be 
cited anymore. 

4.1.4 Numerical simulations 

Our further studies of citation statistics (not shown here) 
combined correlation measurements with other parame- 
ters of the correlated Poisson process governing the cita- 
tion dynamics of each paper, namely, stochastic and deter- 
ministic parts of the EqfJ] We performed numerical simu- 
lations of citation dynamics based on these measurements 
and without any additional parameters, and found aggre- 
gated citation distribution. The simulations developed a 
runaway population of papers consistent with that ob- 
served in Figl21 The simulations based only on the stochas- 
tic version of the EqfJ] (without correlations) yielded dis- 
crete power-law citation distribution and no runaways. 

5 Discussion 

Unlike other fields where " dragon king" events have nega- 
tive emotional connotation and are associated with catas- 
trophes and crashes, the runaways in citation statistics 



2 Both power-law and log-normal fits severely underesti- 
mate the extreme tail of the citation distribution that 
contain papers with more than few thousands citations 
(~ 0.05% of all papers). These "runaways" exhibit sig- 
nificant autocorrelation in their citation rate (the Pear- 
son autocorrelation coefficient exceeds 0.9.) in such a 
way that it becomes deterministic. 

3 In the long run, citation distribution for the age-and 
field-homogeneous set of papers reveals two popula- 
tions: ordinary papers whose citations come to satu- 
ration after 15-20 years and "runaways" that continue 
to be cited even after 20 years. 
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